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{S| Abstract 

i 1 The task of clustering a set of objects based on multiple sources 

i—\ of data arises in several modern applications. We propose an integra- 

tive statistical model that permits a separate clustering of the objects 
for each data source. These separate clusterings adhere loosely to an 
overall consensus clustering, and hence they are not independent. We 

1 describe a computationally scalable Bayesian framework for simultane- 

ous estimation of both the consensus clustering and the source-specific 
clusterings. We demonstrate that this nexible approach is more ro- 

O bust than joint clustering of all data sources, and is more powerful 

than clustering each data source separately. This work is motivated 
by the integrated analysis of heterogeneous biomedical data, and we 
present an application to subtype identification of breast cancer tu- 
mor samples using publicly available data from The Cancer Genome 

CO Atlas. Software is available at http://people.duke.edu/~elll3/ 

T 7^ software.html. 
> 

^ 1 Motivation 

a 

Several fields of research now analyze multi-source data (also called multi- 
modal data), in which multiple heterogeneous datasets describe a common 
set of objects. Each dataset represents a distinct mode of measurement 
or domain. Table 1 gives examples of multi-source data from very diverse 
research areas. 

While the methodology described in this article is broadly applicable, our 
primary motivation is the integrated analysis of heterogeneous biomedical 
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Field 


Object 


Data sources 


Computational biology 


Tissue samples 


Gene expression, microRNA, 
genotype, protein abun- 
dance/activity 


Chemometrics 


Chemicals 


Mass spectra, NMR spectra, 
atomic composition 


Atmospheric Sciences 


Locations 


Temperature, humidity, particle 
concentrations over time 


Text learning 


Documents 


Word frequencies, authors, cited 
documents 



Table 1: Examples of multi-source data. 



data. The diversity of platforms and technologies that are used to collect 
genomic data, in particular, is expanding rapidly. Often multiple types of 
genomic data, measuring various biological components, are collected for a 
commen set of samples. For example, The Cancer Genome Atlas (TCGA) 
is a large-scale collaborative effort to collect and catalog data from several 
genomic technologies. The integrative analysis of data from these disparate 
sources provides a more comprehensive understanding of cancer genetics and 
molecular biology. In Section 6 we present an analysis of RNA expression, 
DNA methylation, microRNA expression, and proteomic data from TCGA 
for a common set of breast cancer tumor samples. 

Separate analyses of each data source may lack power and will not capture 
inter-source associations. At the other extreme, a joint analysis that ignores 
the heterogeneity of the data may not capture important features that are 
specific to each data source. Exploratory methods that simultaneously model 
shared features and features that are specific to each data source have recently 
been developed as flexible alternatives [10, 9, 20, 16]. The demand for such 
integrative methods motivates a very dynamic area of statistics and machine 
learning. 

This article concerns integrative clustering. Clustering is a very widely 
used exploratory tool to identify similar groups of objects (for example, clin- 
ically relevant disease subtypes). Hundreds of general algorithms to perform 
clustering have been proposed in the literature. However, our work is moti- 
vated by the need for an integrative clustering method that is computation- 
ally scalable and robust to the unique features of each data source. 
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2 Related Work 



2.1 Integrative clustering 

Most applications of clustering multi-source data follow one of two general 
approaches: 

1. Clustering of each data source separately, potentially followed by a post 
hoc integration of these separate clusterings. 

2. Combining all data sources to determine a single "joint" clustering. 

Under approach (1) the level of agreement between the separate clusterings 
may be measured by the adjusted rand index [6] or a similar statistic. Fur- 
thermore, consensus clustering can be used to determine an overall partition 
of the objects that agrees the most with the data-specific clusterings. Several 
objective functions and algorithms to perform consensus clustering have been 
proposed (for a survey see Nguyen & Caruana [13]). These methods are most 
commonly used to establish consensus among multiple clustering algorithms, 
or multiple realizations of the same clustering algorithm, on a single dataset. 
Consensus clustering has also been used to integrate multi-source data [4, 1] . 
Such an approach is attractive in that it models source-specific features yet 
still determines an overall clustering, which is often of practical interest. 
However, the two stage process of performing entirely separate clusterings 
followed by post hoc integration limits the power to identify and exploit 
shared structure (see Section 5.2 for an illustration of this phenomenon). 

Approach (2) effectively exploits shared structure, at the expense of failing 
to recognize features that are specific to each data source. Within a model- 
based statistical framework, one can find the clustering that maximizes a joint 
likelihood. Assuming that each source is conditionally independent given 
the clustering, the joint likelihood is the product of the likelihood functions 
for each data source. This approach is used by Kormaksson et al. [8] in 
the context of integrating gene expression and DNA methylation data. The 
iCluster method [18, 11] performs clustering by first fitting a Gaussian latent 
factor model to the joint likelihood; clusters are then determined by K-means 
clustering of the factor scores. Rey & Roth [17] propose a dependency-seeking 
model in which the goal is to find a clustering that accounts for associations 
across the data sources. 

Perhaps most similar to our approach in spirit and motivation is the Mul- 
tiple Dataset Integration (MDI) method [7], which uses a statistical frame- 
work to cluster each data source separately while simultaneously modeling 
dependence between the clusterings. By explicitly modeling dependence, 
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MDI permits borrowing strength across data sources. The fundamental dif- 
ference between MDI and the approach we describe in the following concerns 
how dependence is modeled. Specifically, MDI models the dependence be- 
tween each pair of data sources rather than adherence to an overall clustering. 
We elaborate on this distinction in Sections 3.1 and 5.2. 

2.2 Finite Dirichlet mixture models 

Here we briefly describe the finite Dirichlet mixture model for clustering a 
single dataset, with the purpose of laying the groundwork for the integrative 
model given in Section 3. Given data X n for N objects (n = 1, ...,N), the 
goal is to partition these objects into at most K clusters. Typically X n is 
a mult i- dimensional vector, but we present the model in sufficient generality 
to allow for more complex data structures. Let f(X n \9) define a probability 
model for X n given parameter(s) 9. For example, / may be a Gaussian 
density defined by the mean and variance 9 = (/i, a 2 ). Each X n is drawn 
independently from a mbcture distribution with K components, specified by 
the parameters 9i, . . . ,9%. Let C n G {1, . . . , K} represent the component 
corresponding to X n , and Hk be the probability that an arbitrary object 
belongs to cluster k: 

TTfc = P(C n = k). 

Then, the generative model is 

X n ~ f(-\9k) with probability Hk- 

We further assume that II = (iri, ■ ■ ■ ,kk) has a Dirichlet distribution pa- 
rameterized by a K- dimensional vector /3 of positive reals. This allows some 
7Tfc to be small, and therefore N objects may not represent all K clusters. 
Letting K — y oo gives a Dirichlet process. 

Under a Bayesian framework one can put a prior distribution on II and 
the parameter set = (9i, . . . , 9 k)- Standard computational methods such 
as Gibbs sampling can then be used to approximate the posterior distribution 
for n, 0, and C = (Ci, . . . , C n) (for an overview see Neal [12]). 

3 Integrative model 

We extend the Dirichlet mbcture model to accommodate data from M sources 
Xi, . . . , Xjv;f. Each data source is available for a common set of N objects, 
where X mn represents data m for object n. Each data source requires a prob- 
ability model f m (X n \9 m ) parametrized by 9 m . Under the general framework 
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presented here each X m may have disparate structure. For example X\ n may 
give an image where /i defines the spectral density for a Gaussian random 
field, while X 2n may give a categorical vector where / 2 defines a multivariate 
probability mass function. 

We assume there is a separate clustering of the objects for each data 
source, but that these adhere loosely to an overall clustering. Formally, each 
X mn n = 1, . . . , N is drawn independently from a fC-component mbcture dis- 
tribution specified by the parameters m i, . . . , Q m K- Let L mn G {1, . . . , K} 
represent the component corresponding to X mn . Furthermore, let C n G 
{1, . . . , K} represent the overall mbcture component for object n. The source- 
specific clusterings L m = (L m i, . . . , L m ^) are dependent on the overall clus- 
tering c = (d, c N y. 

P(L mn = k\C n ) = v(k, C n , a m ) 

where a m adjusts the dependence function v. The data X m are independent 
of C conditional on the source-specific clustering L m . Hence, C serves only 
to unify L 1? . . . , h M . The conditional model is 

P{J->mn k\X mn , C n , 9 m k) OC v{k-, Cni Otm) fm{X mn \9 m k) ■ 

Throughout this article, we assume v has the simple form 

i j y-i \ J O^m if Cn L mn , 

is{L mn ,C n ,a m ) - | ^ ot herwise [L) 

where a m G [^,1] controls the adherence of data source m to the overall 
clustering. More simply a m is the probability that L mn = C n . So, if a m = 1 
then h m = C. The a m are estimated from the data together with C and 
Li,...,L m . In practice we estimate each a m separately, or assume that 
«i = ... = «m and hence each data source adheres equally to the overall 
clustering. The latter is favored when m = 2 for identifiability reasons. More 
complex models that permit dependence of the a m 's are also potentially 
useful. 

Let 7Tfc be the probability that an object belongs to the overall cluster k: 

TTk = P(C n = k). 

We assume U = (jri, . . . , -k k) follows a Dirichlet(/9) distribution. The proba- 
bility that an object belongs to a given source-specific cluster follows directly: 

1 — a 

P(L mn = k\U) = 7i k a m + (1 - ttjO^y- (2) 
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Moreover, a simple application of Bayes rule gives the conditional distribution 
ofC: 

M 

P(C n = k\L, II, a) OC 7T fc Y[ v( L mn, k, Q! m ), 

m=l 

where v is defined as in (1). 

Although the number of possible clusters K is the same for L 1; . . . ,h M 
and C, the number of clusters that are actually represented may vary. Gen- 
erally the source-specific clusterings L m will represent more clusters than C, 
rather than vice-versa. This follows from Equation (2) and is illustrated in 
AppendixfefSec2. Intuitively if object n is not allocated to any overall clus- 
ter in data source m (i. e., L mn C) then X mn does not conform well to any 
overall pattern in the data. 



Notation 



N 


Number of objects 


M 


Number of data sources 


K 


Number of clusters 


X m 


Data source m 




Data for object n, source m 


f m 


Probability model for source m 


a 

u mk 


Parameters for f m , cluster k 


Pm 


Prior distribution for mk 


C n 


Overall cluster for object n 


TTfc 


Probability that C n = k 


Lmn 


Cluster specific to X mn 


V 


Dependence function for C n and L mn 


Cm 


Probability that L mn = C n 


Table 2: Notation. 



3.1 Marginal forms 

Integrating over the overall clustering C gives the joint marginal distribution 
of ILi , . . . , Ljvf : 

K M 

Oc) OC ^2^k Y\_ "(^iMm)' (3) 
k=l m=l 
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Under the assumption that a>i = . . . = olm the model simplifies: 

K 

P({L mn = k m }^ =1 \U,a) oc £V fc £/** (4) 

k=l 

where t k is the number of clusters equal to k and U = > 1. This 

marginal form facilitates comparison with the MDI method for dependent 
clustering. In the MDI model 0y > control the strength of association 
between the clusterings Lj and h f 

M 

P({L mn = M™=i|n, oc J] ^r mfcm J] (1 + <M (5) 

m=l {i<j\ki=kj} 

where 7f m k = P(L mn = k). For M = 2 and 7Ti. = 7T2- it is straightforward to 
show that (4) and (5) are functionally equivalent under a parameter substi- 
tution (see Appendix C). There is no such equivalence for M > 2, regardless 
of restrictions on II and $. This is not surprising, as MDI gives a general 
model of pairwise dependence between clusterings rather than a model of 
adherence to an overall clustering. 



4 Computation 

Here we present a general Bayesian framework for estimation of the integra- 
tive clustering model. We employ a Gibbs sampling procedure to approxi- 
mate the posterior distribution for the parameters introduced in Section 3. 
The algorithm is general in that we do not assume any specific form for the 
f m and the parameters 9 m k- We use conjugate prior distributions for a m , II, 
and (if possible) 9 mk . 

• a m ~ TBeta(a m , b m , -^), the Beta(a m , b m ) distribution truncated below 
by jf. By default we choose a m — b m — 1, so that the prior for a m is 
uniformly distributed between and 1. 

• IT ~ Dirichlet(/?o). By default we choose /3q = (1, 1, 1), so that the 
prior for II is uniformly distributed on the Standard (M — l)-simplex. 

• The 9 m k have prior distribution p m . In practice, one should choose 
p m so that sampling from the conditional posterior p m (0 m fc|X m , L m ) is 
feasible. 

Markov chain Monte Carlo (MCMC) proceeds by iteratively sampling from 
the following conditional posterior distributions: 
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• 6 m |X m , L m ~ p m (9 mk \X m , L m ) for k = 1, . . . , K. 

• L m |X m , Q m , a m , C ~ P(k\X mn , C n , 9 mk , a m ) for n = 1, . . . , N, where 

P(k\X mn , C n , m ) oc v(k, C n , a m ) f m (X mn \9 m k) ■ 

• a m |C, L m ~ TBeta(a m + r m , 6 m + — r m , i), where r m is the number 
of samples n satisfying L mn = C n . 

• C|L m , n, a ~ P(fc|II, {L mn , a m }^ =1 ) for n = 1, . . . , JV, where 

M 

P(k\Il,{L 

mm m =l) mrii &m ) 

m=l 

• II|C ~ Dirichlet(/?o + p), where pk is the number of samples allocated 
to cluster k in C. 

This algorithm can be suitably modified under the assumption that a\ = 
. . . = q>m (see Appendix A. 2). 

Each sampling iteration produces a different realization of the cluster- 
ings C,Li, ••• ,L m , and together these samples approximate the posterior 
distribution for the overall and source-specific clusterings. However, a point 
estimate may be desired for each of C, Li, • • • , L m to facilitate interpretation 
of the clusters. In this respect methods that aggregate over the MCMC it- 
erations to produce a single clustering, such as that described in [3], can be 
used. 

It is possible to derive a similar sampling procedure using only the marginal 
form for the source-specific clusterings given in Equation (3). However, the 
overall clustering C is also of interest in most applications. Furthermore, 
incorporating C into the algorithm can actually improve computational ef- 
ficiency dramatically, especially if M is large. As presented, each MCMC 
iteration can be completed in O(MNK) operations. If the full joint marginal 
distribution of L\, . . . ,Lm is used the computational burden increases expo- 
nentially with M (this also presents a bottleneck for the MDI method). 

For each iteration, C n is determined randomly from a distribution that 
gives higher probability to clusters that are prevalent in {L ln , . . . , L mn }. In 
this sense C is determined by a random consensus clustering of the source- 
specific clusterings. Hence, we refer to this approach as Bayesian consensus 
clustering (BCC). BCC differs from traditional consensus clustering in three 
key aspects. 
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1. Both the source-specific clusterings and the consensus clustering are 
modeled in a statistical way that allows for uncertainty in all parame- 
ters. 

2. The source-specific clusterings and the consensus clustering are esti- 
mated simultaneously, rather than in two stages. This permits borrow- 
ing of information across sources for more accurate cluster assignments. 

3. The strength of association to the consensus clustering for each data 
source is learned from the data and accounted for in the model. 

We have developed software for the R environment for statistical comput- 
ing [15] to perform BCC on multivariate continuous data sources (available at 
http://people.duke.edu/~elll3/software.html). Specifically, we use a 
Normal-Gamma conjugate prior distribution to model cluster-specific means 
and variances. See Appendbc A for more details. 



4.1 Choice of K 

In theory the specified maximum number of clusters K can be large, for 
example K = N. The number of clusters realized in C and the L m may still 
be small. However, we find that this is not the case for high-dimensional 
structured data such as that used for the genomics application in Section 6. 
The model tends to select a large number of clusters even if the Dirichlet prior 
concentration parameters /3o are very small. From an exploratory perspective 
we would like to identify a small number of interpretable clusters. For this 
reason Dirichlet process models that allow for an infinite number of clusters 
are also not appropriate. 

We therefore consider an alternative heuristic measure that selects the 
value of K that gives maximum adherence to an overall clustering. The 
adjusted adherence parameters a* m G [0, 1] given by 

# _ Ka m - 1 

a m~ K _ 1 

are computed for each candidate value of K. We then select the K giving 
the highest mean adjusted adherence 




771=1 



In practice, we find that this method selects a small number of clusters that 
reveal shared structure across the data sources. 
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5 Simulation 



Here we present applications of BCC to simple simulated datasets. To the 
best of our knowledge there is no existing method that is directly comparable 
to BCC in terms of the motivating model. Nevertheless, we do illustrate the 
flexibility and potential advantages of BCC over other model-based clustering 
methods for multi-source data. 

5.1 Accuracy of a 

We find that with reasonable signal the a m can generally be estimated with 
accuracy and without substantial bias. To illustrate, we generate simulated 
datasets Xi : 1 X 200 and X 2 : 1 x 200 as follows: 

1. Let C define two clusters, where C n = 1 for n <E {1, . . . , 100} and 
C n = 2 for n E {101, ...,200} 

2. Draw a from a Uniform(0.5, 1) distribution. 

3. For m = 1, 2 and n = 1, . . . , 200, generate L mn G {1, 2} with probabil- 
ities P{L mn = C n ) = a and P(L mn ^ C n ) = 1 - a. 

4. For m = 1,2 draw values X mn from a Normal(1.5, 1) distribution if 
L m n = 1 and from a Normal(— 1.5, 1) distribution if L mn = 2 

We generate 100 realizations of the above simulation, and estimate the 
model via BCC for each realization. We assume a\ = «2 in our estimation 
and use a uniform prior; further computational details are given in Section 4 
of the supplemental article. Figure 1 displays a, the best estimate for both a± 
and a 2 , versus the true a for each realization. The point estimate displayed 
is the mean over MCMC draws, and we also display a 95% credible interval 
based on the 2.5 of 97.5 percentiles of the MCMC draws. The estimated a are 
generally close to a, and the credible interval contains the true value in 91 of 
100 simulations. See Section 4 of the supplemental article for a more detailed 
study, including a simulation illustrating the effect of the prior distribution 
on a. 

5.2 Clustering accuracy 

To illustrate the flexibility and advantages of BCC in terms of clustering 
accuracy, we generate simulated data sources X± and X 2 as in Section 5.1 
but with Normal(l,l) and Normal(-l,l) as our mbrture distributions. Hence, 
the signal distinguishing the two clusters is weak enough so that there is 
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True Vs. Estimated a 




~~ i 1 1 1 1 r 

0.5 0.6 0.7 0.8 0.9 1.0 



True a 

Figure 1: Estimated a vs. true a for 100 randomly generated simulations. 
For each simulation the mean value a is shown with a 95% credible interval. 



substantial overlap within each data source. We generate 100 simulations 
and compare the results for four model-based clustering approaches: 

1. Separate clustering, in which a finite Dirichlet mixture model is used 
to determine a clustering separately for X 1 and X 2 . 

2. Joint clustering, in which a finite Dirichlet mixture model is used to 

X 

determine a single clustering for the concatenated data „ 

3. Dependent clustering, in which we model the pairwise dependence be- 
tween each data source, in the spirit of MDI. 

4. Bayesian consensus clustering. 

The full implementation details for each method are given in Section 5 of the 
supplemental article. 
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We consider the relative error for each model in terms of the average 
number of incorrect cluster assignments: 

Em=l ^2n=l ^-{Lmn 7^ L mn \ 

MN 

where 1 is the indicator function. Note that for joint clustering Li = L2. 
The relative error for each clustering method on M = 3 datasets is shown 
in Figure 2 (top panel). Smooth curves are fit to the data using LOESS [2] 
and display the relative clustering error for each method as a function of a. 
Not surprisingly, joint clustering performs well for a ~ 1 (perfect agreement) 
and separate clustering is superior when a ~ 0.5 (no relationship). BCC and 
dependent clustering learn the level of cluster agreement, and hence serve as 
a flexible bridge between these two extremes. 

The bottom panel of Figure 2 displays the results for a similar simulation 
in which M = 3 data sources are generated. Again, BCC is competitive 
with separate clustering when a ~ 0.5 and joint clustering when a ~ 1. 
Dependent clustering does not perform as well in this example. The pairwise- 
dependence model does not assume an overall clustering and hence it has less 
power to learn the underlying structure for M > 2. 

6 Application to Genomic Data 

We apply BCC to multi-source genomic data on breast cancer tumor sam- 
ples from TCGA. For a common set of 348 tumor samples, our full dataset 
includes 

• RNA expression (GE) data for 645 genes. 

• DNA methylation (ME) data for 574 probes. 

• miRNA expression (miRNA) data for 423 miRNAs. 

• Reverse Phase Protein Array (RPPA) data for 171 proteins. 

These four data sources are measured on different platforms and represent 
different biological components. However, they all represent genomic data 
for the same samples and it is reasonable to expect some shared structure. 
These data are publicly available from the TCGA Data Portal. See http: 
//people . duke . edu/~elll3/sof tware .html for R code to completely re- 
produce the analysis, including instructions on how to download and process 
these data from the TCGA Data Portal. 
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M=2 




True a 




Figure 2: Relative clustering error for 100 simulations with M = 2 (a) and 
M = 3 (b) data sources, shown for separate clustering (blue), joint cluster- 
ing (red), dependent clustering (green) and BCC (black). A LOESS curve 
displays clustering error as a function of a for each method. 

Previously, four comprehensive sample subtypes were identified based on 
a multi-source consensus clustering of these data [1]. These subtypes were 
shown to be clinically relevant, as they may be used for more targeted ther- 
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apies and prognoses. 

We select K — 3 clusters for BCC, based on the heuristic described 
in Section 4.1. Full computational details, as well as charts that illustrate 
mixing over the MCMC draws, are available in Appendbc F. Table 3 shows a 
matching matrix comparing the overall clustering C with the comprehensive 
subtypes defined by TCGA. The two partitions have a significant but weak 
association, and this suggests they may not be driven by the same biological 
signal. Similar tables that compare the source-specific clusterings L mn with 
source-specific subtypes are available in Appendix F. 3. 



Table 3: BCC cluster vs. TCGA comprehensive subtype matching matrbc. 

Figure 3 provides a point-cloud view of each dataset given by a scatter 
plot of the the first two principal components. The overall and source-specific 
cluster index is shown for each sample, as well as a point estimate for the ad- 
herence parameter a (alternatively, clustering heatmaps for each data source 
are given in Appendbc F. 4). The four data sources show different structure, 
and the source-specific clusters are more well distringuished than the overall 
clusters in each plot. However, the overall clusters are clearly represented to 
some degree in all four plots. Hence, the flexible yet integrative approach of 
BCC seems justified for these data. 

7 Conclusions and discussion 

This work was motivated by the perceived need for a general, flexible, and 
computationally scalable approach to clustering multi-source data. We pro- 
pose BCC, which models both an overall clustering and a clustering specific 
to each data source. We view BCC as a form of consensus clustering, with 
advantages over traditional methods in terms of modeling uncertainty and 
the ability to borrow information across sources. 

The BCC model assumes a very simple and general dependence between 
data sources. When an overall clustering is not sought, or when such a 
clustering does not make sense as an assumption, a more general model of 
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GE (a = 0.91) 



ME (a = 0.69) 




Figure 3: PCA plots for each data source. Sample points are colored by 
overall cluster; cluster 1 is black, cluster 2 is rcd, and cluster 3 is blue. 
Symbols indicate source-specific cluster; cluster 1 is cluster 2 is '+', and 
cluster 3 is V . 

cluster dependence (such as MDI) may be more appropriate. Furthermore, 
a context-specific approach may be necessary when more is known about 
the underlying dependence of the data. For example, [14] exploit functional 
covariance models for time-course data to determine overall and time-specific 
clusters. 

Our implementation of BCC assumes the data are normally distributed 
and models cluster-specific mean and variance parameters. It is straightfor- 
ward to extend this approach to more complex clustering models. In particu- 
lar, models that assume clusters exist on a sparse feature set [19] or allow for 
more general covariance structure [5] are growing in popularity. While we fo- 
cus on multi-source biomedical data, the applications of BCC are potentially 
widespread. In addition to multi-source data, BCC may be used to compare 
clusterings from different statistical models for a single homogeneous dataset. 
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A Computational details 
A.l Normal-gamma model 

Here we fiil in the details for a specific case of the Bayesian computational 
framework given in Section4. We assume Xj has a normal-gamma mixture 
distribution with cluster-specific mean and variance. That is, 

X mn \L mn k ~ N (fi mk , S m £.), 

where 

• fi m k is a D m dimensional mean vector, where D m is the dimension of 
data source m. 

• S mfc is &D m xD m diagonal covariance matrbc, S mfc = Diag(cr mfcl , . . . , a mkDm ). 

We use a D m dimensional normal-inverse-gamma prior distribution for Q mk = 
(fimk, s mfc) • That is, 

d m k ~ NT~ (r) m0 , Ao, A m0 , B m0 ), 

where rj m0 , X ,A m0 and B m0 are hyperparameters. It follows that fi m i~ and 
S mfc are given by 

• ~ Gamma(A m0 d, J B m0 d), and 

mkd 

2 

• Vmkd ~ N(rj m0 , ^f^) for d = 1, ... , D m . 

By default we set Ao = 1, and estimate fi m o,A m0 and B m0 from the mean 
and variance of each variable in X m . 

The i'th iteration in the MCMC sampling scheme procedes as follows: 

1. Generate 6™ given {X m ,h m for m = 1,...,M. The posterior 
distribution for 6^ k , k = 1, . . . , K is 

Let N mk be the number of samples allocated to cluster k in L™ -1 ', X mk 
be the sample mean vector for cluster k, and S m k the sample variance 
vector for cluster k. The posterior normal-gamma parameters are 
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. 4 W _ A i n 

_ p(*) p i N mk S m k , ApA f mfc (X mfc — )i m fl) 2 

^mO - fl mO + 2 2(A +iV mfc ) 

2. Generate L# given {X m , 6^, C^" 1 )}, for m = 1, . . . , M. The 
posterior probability that Lmn — k for k — 1, . . . , K is proportional to 

where / m is the multivariate normal density defined by m fc = (/i m fc, £ m fc). 

3. Generate «t' given {C^ -1 ^,L^}, for m = 1,...,M. The posterior 
distribution for «t' is TBeta(a m + r m , b m + iV — r m , i), where r m is the 
number of samples n satisfying Lmn = C n 1 \ 

4. Generate given {Lm,n^ _1 ',a' ! '}. The posterior probability that 
Cn^ = k for k = 1, . . . , K is proportional to 

A/ 
m=l 

5. Generate II^ given C^. The posterior distribution for IT l ) is Dirichlet 
(A) + p) where pk is the number of samples allocated to cluster k in 

By default, we initialize L 1; . . . , by a K-means clustering of each dataset. 
After running the Markov chain for a specified number of iterations (e. g., 
10000), the method described in [3] is used to determine a hard clustering 
for each of C, L 1; . . . , L m . 

A. 2 Assuming equal adherence 

It is straightforward to modify the procedure above under the assumption 
that each data source adheres equally well to the overall clustering C. Rather 
than modeling a 1; . . . ,a m separately, we assume a = «i = . . . = a m . The 
prior for a is a truncated beta distribution: 

a ~ TBeta(a, b, —). 
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The MCMC sampling scheme procedes exactly as above, except that in step 
(3) we need only generate a' 1 ' from the posterior distribution TBeta(a + r, b+ 
N M — t, jf). Here r = Xlm=i r m' wnere the r m are as defined in step (3) 
above. 



B Cluster size illustration 

Recall that n k is the marginal probability that an object belongs to the overall 
cluster k: 

Tik = P(C n = k). 

The probability that an object belongs to a given source-specific cluster is 
then 

1 — a m 

P(L mn = k\U) = 7r k a m + (1 - 7r fc ) — _ ^ ■ 

As a consequence, the size of the source-specific clusters are generally more 
uniform than the size of the overall clusters. In particular, the source-specific 
clusterings L m will generally represent more clusters than C, rather than 
vice-versa. 



Overall clusters 



D 



UUDa 



1 23456789 10 
Cluster 



Source-specific clusters (<x=0.95) 



1 23456789 10 
Cluster 



Source-specific clusters (a=0.75) 



Source-specific clusters (a=0.10) 



1 23456789 10 
Cluster 



□□□□□□□□□□ 
1 23456789 10 
Cluster 



Figure 4: Marginal cluster inclusion probabilities are shown for the overall 
clusters (n) with K = 10 (top-left). The source-specific cluster probabilities 
induced by II are shown for the adherence levels a rn = 0.95,0.75 and 0.10. 
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As an illustration, we set K — 10 and assume the Tk have the skewed 
distribution shown in the top left panel of Figure 4. The marginal cluster 
inclusion probabilities for a given data source depend on its adherence a m 
to the overall clustering. Not surprisingly, for a m close to 1 the inclusion 
probabilities closely resemble those for the overalll clustering (top right panel 
of Figure 4). As a m approaches 4, the inclusion probabilities are more 
uniform (bottom two panels of Figure 4). In particular, clusters that had 
zero probability to occur in C have positive probability to occur in L m . 
Hence, a sample that does not fit any overall pattern in a given data source 
(e. g., an outlier) need not be allocated to an overall cluster. 



C Equivalence of BCC and MDI 

Here we compare the MDI and BCC models for M = 2 data sources, giving 
conditions where the two models are equivalent under a parameter substitu- 
tion. 

Assume a = aii = a 2 , and let U = ^z-. The joint distribution of (L 1? L 2 ) 
under BCC is then 

( TX X U 2 + (1 - TTj) if h = k 2 = 1 
P({L mn = k m }\U, a) oc \ Tri + (1 - n^U 2 if h = k 2 = 2 

( U if h ^ k 2 . 

Assume n = tti. = Tf 2 . and let = <pi 2 in the MDI clustering model. The 
joint distribution of (Li,!^) under MDI is then 

r 7f?(i + 0) if h = k 2 = i 

P({L mn = k m }\fi,<f>)<x l (l-^) 2 (l + 0) if fci = fc 2 = 2 

[ 7Ti(l - 7Ti) if fci ^ fc 2 . 

It is straightforward to verify that the two forms are equivalent under the 
substitutions 

* = ^((i-^ + s)(^ + i^)-i 

and 

= y/(l - Tn)^ 1 + TT!^ 

711 ^{l - tti)C/ + TTif/- 1 + v/(l - n^U- 1 + mU 

There is no such equivalence for M > 2, regardless of restrictions on II 
and $. 
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D Prior comparison for a 



We use a simple simulation to illustrate the effect of the prior distribution 
for a. We generate datasets Xi : 1 x 200 and X2 : 1 x 200 as in Section5.1: 

1. Let C define two clusters, where C n = 1 for n G {1, . . . , 100} and 
C n = 2 for n G {101, . . . , 200} 

2. Draw a from a Uniform (0.5, 1) distribution. 

3. For m = 1,2 and n = 1,...,200, generate L mn G {1,2} so that 
P{L mn = C n ) = a and P(L mn ^ C n ) = 1 - a. 

4. For m = 1,2 draw values X mn from a Normal(1.5, 1) distribution if 
L m n — 1 and from a Normal(— 1.5, 1) distribution if L mn = 2 

We estimate the BCC model under the assumption that a = aci = «2, 
where a has prior distribution TBeta(a, b, ^), for various values of a and b. 
The uniform prior (a = b = 1) gives relatively unbiased results, as illustrated 
in Figure 1 of the main mansuscript. Figure 5 displays the estimated values 
a for alternative choices of a and b. Not surprisingly, for very precise priors 
(large a and b) the a are highly influenced by the prior and are therefore 
inaccurate. However, the a appear to be robust for moderately precise priors. 

E Clustering comparison details 

Here we describe the computational details for the four procedures used in 
the clustering comparison study given in Section 5.2. For each procedure the 
MCMC algorithm ran for 1000 iterations (after 200 iterations of "burn-in"), 
and a hard clustering was determined as in Dahi [3]. 

E.l Separate clustering 

We use a normal-gamma mbcture model to cluster each X m . The marginal 
probability that X mn is allocated to cluster k is 7r mfc , where U m = (7r ml , . . . , 7r mfc ) ~ 
Dirichlet(/3o). We use K = 2 clusters and (3q = (1, 1). The z'th iteration in 
the MCMC sampling scheme procedes as follows: 

1. Generate O™ given {X m , L^ -1 ' 1 }. The posterior distribution for 9® k , 
k = 1, . . . , K is 

/)(«') ATV- 1 (m( i ') \W A W d(»)\ 
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True a 



Truea 




Figure 5: Scatterplots of a versus the true a are shown for various prior 
distributions on a. Each prior is of the form TBeta(a, b, |), and the density 
of each prior is shown below the relevant scatterplot. 

Let N m k be the number of samples allocated to cluster k in L™ *\ X m k 
be the sample mean vector for cluster k, and S m k the sample variance 
vector for cluster k. The posterior normal-gamma parameters are 

/mfe A +iV mi; 

* ^ = ^0 + iV mfc 

_ 4 W _ 4 i n 

• ^mO ~~ ^mO "T 2 

_ E>(0 p I NmkSmk i ^)^ f mfe(^inj;-/'m()) 2 

B ™0 ~~ + 2 2(A +iV mfe ) 

2. Generate Lm given {X m , 9™ , Ilm -1 ''}- The posterior probability that 
Lmn = k for k = 1, . . . , K is proportional to 

^mfc /m {X mn | $ m £, ) , 

where f m is the multivariate normal density defined by 9 m k = (fJ> m k, ^mk) 

3. Generate 11$ given L$. The posterior distribution for U m i =W is 
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Dirichlet (/3q + p m ) where p m k is the number of samples allocated to 
cluster k in Lm • 

E. 2 Joint clustering 

We use a normal-gamma mixture model to cluster the concatenated dataset 

Xi 



X 



X 



The computational details are exactly as in Section E.l, except that we 
perform the algorithm on the joint data X rather than separately for each 

E. 3 Dependent clustering 

For our dependent clustering model we let a mini2 be the probability that 
L min = L m2n , where a mirri2 ~ TBeta(a, b, 4). Hence, we model the clustering 
dependence between each pair of datasets, rather than adherence to an overall 
clustering. The marginal probability that X mn is allocated to cluster k is n m k, 
where Tl m = (7r ml , . . . , 7r mfe ) ~ Dirichlet (/3 ). We use K = 2 clusters,/3 = 
(1,1), and a = b = 1. The z'th iteration in the MCMC sampling scheme 
procedes as follows: 

1. Generate 0^ given {X m ,Lm -1 ^}, as in step (1) of Section A.l. 

2. Generate hf , . . . , L$ given {X, a^, 6«, n^" 1 )}. For n = 1, . . . , N, 
the joint posterior probability for {L mn = k m }^ =1 is proportional to 

M M 

Tri' i ^rnm f ) ' 

m=l m'=m+l 

3. Generate atU given {Lm 17 Lm 2 }, for all pairs {(mi,m 2 ) : m\ = 
l,...,Mandm 2 = (mi + 1 ),..., M} . The posterior distribution 
for ain\m 2 is TBeta(a + r mim2 , b + N — r mim2 , i), where r mim2 is the 
number of samples n satisfying Lmm = Lm 2 n- 

4. Generate Tim given Lm. The posterior distribution for Tim is Dirichlet 
(/3 + Pm) where p m k is the number of samples allocated to cluster k in 

L W 
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E.4 BCC 



We implement BCC as described in Section A, assuming equal adherence. 
We use K = 2 clusters, /3 = (1, 1), and a = b = 1. 

F TCGA application details 

Here we provide more details for the application to heterogenous genomic 
data from TCGA that is presented in Section 6. We focus on the application 
of BCC to GE, ME, miRNA and RPPA data for 348 breast cancer tumor 
samples. For more information on the origin of these data and pre-processing 
details see [1]. 

F.l Choice of K 

We use the heuristic method described in Section 4.1 of the main article 
to choose the number of clusters K. The full BCC model is estimated as 
in Section A, for the potential values K = 2, . . . , 10. We model the a m 
separately, set a m = b m = 1 for each m, and set /3q = (1, . . . , 1). We run 
the MCMC sampling scheme for 10, 000 iterations for each K (the first 2000 
iterations are used as a "burn-in" ). We compute the mean adjusted adherence 
d* K for each K. Figure 6 shows a point estimate for a* K (an average over the 
MCMC iterations), as well as a 95% credible interval based on the MCMC 
iterations, as a function of K. The maximum value is obtained for K = 3 
( a* K = 0.57), although the adherence level is comparable for K = 4 and 
K = 6. 

F. 2 MCMC mixing 

We consider the 10, 000 MCMC draws for K = 3. The draws appear to 
mix well and they converge quickly to a stationary posterior distribution. 
Figure 7 shows the draws for a m , m = 1, ... ,4. These are the estimated 
adherence to the overall clustering for GE, ME, miRNA and RPPA. They 
appear to converge within the first 1000 iterations to an approximately sta- 
tionary distribution. The average values are a = 0.91 for GE, a = 0.69 
for ME, a = 0.56 for miRNA and a = 0.70 for RPPA. Figure 8 shows the 
marginal overall cluster inclusion probabilities iik over the MCMC draws. 
These also quickly converge to a stationary distribution. The average values 
are n 1 = 0.24, vr 2 = 0.28 and tt 3 = 0.48. 
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Figure 6: The mean adjusted adherence a* K is shown after estimating the 
model with K — 2, . . . , 10. A point estimate given by the average of the 
MCMC draws, and a credible interval given by the 2.5 and 97.5 percentiles 
of the MCMC draws, are shown for each K. 

F. 3 Comparison with TCGA subtypes 

Table 3 of the main article compares the overall clusters identified by BCC 
with the four overall subtypes defined by TCGA. TCGA has also defined 
subtypes that are particular to each data source, and we compare the source- 
specific clusterings identified by BCC with the source-specific subtypes de- 
fined by TCGA. Tables 4, 5, 6, and 7 show the cluster vs. subtype matching 
matrix for GE, ME, miRNA and RPPA, respectively. The subtypes for GE 
and RPPA are named by TCGA according to their biological profile. In all 
cases the two sample partitions have a significant association (p-value < 0.01; 
Fisher's exact test). However, these associations are not exceptionally strong, 
suggesting that the two partitions may not be driven by the same structure 
in the data. 
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Figure 7: Values of shown for the data sources GE, ME, miRNA and 

RPPA over the 10,000 MCMC draws. 



GE 


BCC cluster 
12 3 




Basal 


65 1 




HER2 


14 5 23 


TCGA subtype 


Luminal A 


78 76 




Luminal B 


5 76 




Normal 


2 3 



Table 4: BCC cluster vs. TCGA subtype matching matrbc for GE data. The 
GE subtypes are named according to their biological profile. 
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8: Values of n k are shown for k = 1, 2, 3 over the 10,000 MCMC 



ME 


BCC cluster 
12 3 




1 


46 12 




2 


1 88 1 


TCGA subtype 


3 


36 




4 


3 87 




5 


73 1 



Table 5: BCC cluster vs. TCGA subtype matching matrix for ME data. 



F. 4 Heatmaps 

Figures 9, 10, 11, and 12 show heatmaps of the GE, ME, miRNA, and RPPA 
datasets, respectively. The processed data that was used for clustering is 
shown in each case. Samples (columns) are grouped by their relevant source- 
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Table 6: BCC cluster vs. TCGA subtype matching matrix for miRNA data. 





BCC cluster 


RPPA 


12 3 


Basal 


61 1 7 


Her2 


29 1 13 


TCGA subtype Luminal A/B 


1 14 105 


ReacI 


61 2 


ReacII 


7 34 



Table 7: BCC cluster vs. TCGA subtype matching matrix for RPPA data. 
The RPPA subtypes are named according to their biological profile. 



specific cluster. For each heatmap a cluster effect is apparent in a large 
number of variables (rows). 
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